feat: add native implementations of regexp_extract and regexp_extract_all#4146
feat: add native implementations of regexp_extract and regexp_extract_all#4146andygrove wants to merge 6 commits into
regexp_extract and regexp_extract_all#4146Conversation
Implement regexp_extract using the Rust regex crate. The expression is marked Incompatible because the Rust regex engine differs from the Java engine that Spark uses; users must opt in via spark.comet.expression.RegExpExtract.allowIncompatible=true.
Audit follow-ups: - Align Rust error messages with Spark's `INVALID_PARAMETER_VALUE` templates so `expect_error` substrings can match both engines. - Override `getUnsupportedReasons` in `CometRegExpExtract` so the non-literal pattern and non-literal idx reasons are picked up by the Compatibility Guide generator. - Add Comet SQL test cases for: NULL pattern and NULL idx, idx=0 with no capture groups, multibyte / Unicode subjects, idx out of range, pattern with no groups + idx>=1, negative idx, invalid regex syntax, and a Java-only lookahead that Rust regex rejects (marked `ignore`). - Add fallback test cases for non-literal pattern and non-literal idx. - Mark the expression supported in `spark_expressions_support.md` with per-version audit notes.
Address review feedback: - Make `extract_array` build a `GenericStringBuilder<O>` matching the input offset size so a `LargeUtf8` subject no longer silently outputs `Utf8` (avoids potential i32-offset overflow on >2GB inputs). - Inline group extraction so the per-row `String` allocation is gone; the only remaining `to_string` is on the rare scalar code path. - Replace the manual append-null loop in `null_result` with `StringArray::new_null(n)`. - Borrow the pattern as `&str` instead of cloning it before calling `Regex::new`. - Pass `failOnError = false` to the proto, matching `CometStringSplit`. The Rust UDF does not branch on this flag, so `true` was misleading.
regexp_extractregexp_extract
The test data has no duplicate rows, so the parquet.enable.dictionary matrix produces two identical runs.
Adds a native Rust UDF `spark_regexp_extract_all` and a `CometRegExpExtractAll` serde, paralleling the existing `regexp_extract` support. Returns `List<Utf8>` containing the matched group across every non-overlapping match of the pattern. Reported as Incompatible because the Rust regex engine differs from Java's; gated on `spark.comet.expression.RegExpExtractAll.allowIncompatible=true`. Falls back when the pattern or `idx` is non-literal.
# Conflicts: # spark/src/main/scala/org/apache/comet/serde/strings.scala
regexp_extractregexp_extract and regexp_extract_all
|
Thanks for tackling these @andygrove, the support-level handling and the SQL test shape (including the explicit A few things I wanted to ask about: Could we lift this into a
|
Which issue does this PR close?
Closes #2708
Rationale for this change
regexp_extractandregexp_extract_allare common Spark SQL string functions used to pull substrings out of input strings via regex capture groups. Adding native support lets queries that use them stay in Comet instead of falling back to Spark.What changes are included in this PR?
regexp_extract:spark_regexp_extractinnative/spark-expr/src/string_funcs/regexp_extract.rs, backed by theregexcrate. Handles Utf8 and LargeUtf8 inputs (array and scalar), idx defaults to 1, idx=0 returns the whole match, no match returns the empty string, an unmatched optional group returns the empty string, null input returns null, and an out-of-range idx returns an execution error.regexp_extractincomet_scalar_funcs.rs.CometRegExpExtractserde mappingRegExpExtractto the native UDF. Reported asIncompatiblebecause the Rust regex engine has different semantics from Java's regex engine (POSIX classes, look-around, possessive quantifiers, etc.). Users opt in viaspark.comet.expression.RegExpExtract.allowIncompatible=true. Falls back when the pattern oridxis non-literal.regexp_extract_all:spark_regexp_extract_allinnative/spark-expr/src/string_funcs/regexp_extract_all.rs, paralleling theregexp_extractUDF. ReturnsList<Utf8>containing the selected capture group across every non-overlapping match of the pattern.regexp_extract_allincomet_scalar_funcs.rs.CometRegExpExtractAllserde mappingRegExpExtractAllto the native UDF. Also reported asIncompatiblefor the same regex-engine reasons; gated onspark.comet.expression.RegExpExtractAll.allowIncompatible=true. Falls back when the pattern oridxis non-literal.How are these changes tested?
regexp_extract.rscovering basic group extraction, idx=0/default idx, null subject, null pattern, unmatched optional group, out-of-range idx, negative idx, and invalid regex.regexp_extract_all.rscovering the analogous cases for the array-returning variant.regexp_extract.sql/regexp_extract_all.sqlverify that each expression falls back to Spark by default.regexp_extract_enabled.sql/regexp_extract_all_enabled.sqlexercise the happy paths withallowIncompatible=true, including default and explicit idx, idx=0, no-match, null input, optional unmatched groups, anchors, and all-literal expressions. Run under aConfigMatrixfor both dictionary-encoded and plain Parquet input.